content encoder
DarkStream: real-time speech anonymization with low latency
Quamer, Waris, Gutierrez-Osuna, Ricardo
Abstract--We propose DarkStream, a streaming speech synthesis model for real-time speaker anonymization. T o improve content encoding under strict latency constraints, DarkStream combines a causal waveform encoder, a short lookahead buffer, and transformer-based contextual layers. T o further reduce inference time, the model generates waveforms directly via a neural vocoder, thus removing intermediate mel-spectrogram conversions. Evaluations show our model achieves strong anonymization, yielding close to 50% speaker verification EER (near-chance performance) on the lazy-informed attack scenario, while maintaining acceptable linguistic intelligibility (WER within 9%). By balancing low-latency, robust privacy, and minimal intelligibility degradation, DarkStream provides a practical solution for privacy-preserving real-time speech communication. V oice recordings contain rich biometric information that reveals not only linguistic content but also personal attributes such as speaker identity, sex, and age, as well as paralin-guistics (dialect/accent, emotions). Such sensitive information can be exploited by adversaries for speaker recognition and profiling, raising significant privacy concerns.
- Information Technology > Security & Privacy (1.00)
- Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.34)
FasterVoiceGrad: Faster One-step Diffusion-Based Voice Conversion with Adversarial Diffusion Conversion Distillation
Kaneko, Takuhiro, Kameoka, Hirokazu, Tanaka, Kou, Kondo, Yuto
A diffusion-based voice conversion (VC) model (e.g., V oice-Grad) can achieve high speech quality and speaker similarity; however, its conversion process is slow owing to iterative sampling. FastV oiceGrad overcomes this limitation by distilling V oiceGrad into a one-step diffusion model. However, it still requires a computationally intensive content encoder to disentangle the speaker's identity and content, which slows conversion. Therefore, we propose FasterV oiceGrad, a novel one-step diffusion-based VC model obtained by simultaneously distilling a diffusion model and content encoder using adversarial diffusion conversion distillation (ADCD), where distillation is performed in the conversion process while leveraging adversarial and score distillation training. Experimental evaluations of one-shot VC demonstrated that FasterV oiceGradachieves competitive VC performance compared to FastV oiceGrad, with 6.6-6.9 and 1.8 times faster speed on a GPU and CPU, respectively.
Streaming Non-Autoregressive Model for Accent Conversion and Pronunciation Improvement
Nguyen, Tuan-Nam, Pham, Ngoc-Quan, Akti, Seymanur, Waibel, Alexander
We propose a first streaming accent conversion (AC) model that transforms non-native speech into a native-like accent while preserving speaker identity, prosody and improving pronunciation. Our approach enables stream processing by modifying a previous AC architecture with an Emformer encoder and an optimized inference mechanism. Additionally, we integrate a native text-to-speech (TTS) model to generate ideal ground-truth data for efficient training. Our streaming AC model achieves comparable performance to the top AC models while maintaining stable latency, making it the first AC system capable of streaming.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States (0.04)
- Europe > Germany > Baden-Württemberg > Karlsruhe Region > Karlsruhe (0.04)
Stepback: Enhanced Disentanglement for Voice Conversion via Multi-Task Learning
VAEs consist of two main parts: a content Voice conversion (VC) modifies voice characteristics while encoder and a decoder. The content encoder processes source preserving linguistic content. This paper presents the Stepback speech, transforms it into a latent representation, and removes network, a novel model for converting speaker identity using speaker information. The decoder takes the speaker identity, non-parallel data. Unlike traditional VC methods that rely on combines it with the latent representation, and reconstructs the parallel data, our approach leverages deep learning techniques speech[5]. A notable VAE approach is disentangling speaker to enhance disentanglement completion and linguistic content and content representations using instance normalization, which preservation.
- Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.14)
- Asia > China > Shanghai > Shanghai (0.04)
JOOCI: a Framework for Learning Comprehensive Speech Representations
Yadav, Hemant, Shah, Rajiv Ratn, Sitaram, Sunayana
Information in speech can be divided into two categories: "what is being said" (content) and "how it is expressed" (other). Current state-of-the-art (SOTA) techniques model speech at fixed segments, usually 10-25 ms, using a single embedding. Given the orthogonal nature of other and content information, attempting to optimize both within a single embedding results in suboptimal solutions. This approach divides the model's capacity, limiting its ability to build complex hierarchical features effectively. In this work, we present an end-to-end speech representation learning framework designed to jointly optimize the "other" and "content" information (JOOCI) in speech. Our results show that JOOCI consistently outperforms other SOTA models of similar size (100 million parameters) and pre-training data used (960 hours) by a significant margin when evaluated on a range of speech downstream tasks in the SUPERB benchmark, as shown in Table 1. Code and models are available at TBA. Self-supervised learning (SSL) has played a significant role in learning high-level representations of text (Brown et al., 2020), vision (Alexey, 2020), and audio (Baevski et al., 2020; Mohamed et al., 2022; Défossez et al., 2022) data. In this work, we focus on learning high-level representations from raw speech.
UNIT-DSR: Dysarthric Speech Reconstruction System Using Speech Unit Normalization
Wang, Yuejiao, Wu, Xixin, Wang, Disong, Meng, Lingwei, Meng, Helen
Dysarthric speech reconstruction (DSR) systems aim to automatically convert dysarthric speech into normal-sounding speech. The technology eases communication with speakers affected by the neuromotor disorder and enhances their social inclusion. NED-based (Neural Encoder-Decoder) systems have significantly improved the intelligibility of the reconstructed speech as compared with GAN-based (Generative Adversarial Network) approaches, but the approach is still limited by training inefficiency caused by the cascaded pipeline and auxiliary tasks of the content encoder, which may in turn affect the quality of reconstruction. Inspired by self-supervised speech representation learning and discrete speech units, we propose a Unit-DSR system, which harnesses the powerful domain-adaptation capacity of HuBERT for training efficiency improvement and utilizes speech units to constrain the dysarthric content restoration in a discrete linguistic space. Compared with NED approaches, the Unit-DSR system only consists of a speech unit normalizer and a Unit HiFi-GAN vocoder, which is considerably simpler without cascaded sub-modules or auxiliary tasks. Results on the UASpeech corpus indicate that Unit-DSR outperforms competitive baselines in terms of content restoration, reaching a 28.2% relative average word error rate reduction when compared to original dysarthric speech, and shows robustness against speed perturbation and noise.
StreamVC: Real-Time Low-Latency Voice Conversion
Yang, Yang, Kartynnik, Yury, Li, Yunpeng, Tang, Jiuqiang, Li, Xing, Sung, George, Grundmann, Matthias
We present StreamVC, a streaming voice conversion solution that preserves the content and prosody of any source speech while matching the voice timbre from any target speech. Unlike previous approaches, StreamVC produces the resulting waveform at low latency from the input signal even on a mobile platform, making it applicable to real-time communication scenarios like calls and video conferencing, and addressing use cases such as voice anonymization in these scenarios. Our design leverages the architecture and training strategy of the SoundStream neural audio codec for lightweight high-quality speech synthesis. We demonstrate the feasibility of learning soft speech units causally, as well as the effectiveness of supplying whitened fundamental frequency information to improve pitch stability without leaking the source timbre information.
- North America > United States (0.04)
- Europe > Italy > Calabria > Catanzaro Province > Catanzaro (0.04)
SAIC: Integration of Speech Anonymization and Identity Classification
Cheng, Ming, Diao, Xingjian, Cheng, Shitong, Liu, Wenjun
Speech anonymization and de-identification have garnered significant attention recently, especially in the healthcare area including telehealth consultations, patient voiceprint matching, and patient real-time monitoring. Speaker identity classification tasks, which involve recognizing specific speakers from audio to learn identity features, are crucial for de-identification. Since rare studies have effectively combined speech anonymization with identity classification, we propose SAIC - an innovative pipeline for integrating Speech Anonymization and Identity Classification. SAIC demonstrates remarkable performance and reaches state-of-the-art in the speaker identity classification task on the Voxceleb1 dataset, with a top-1 accuracy of 96.1%. Although SAIC is not trained or evaluated specifically on clinical data, the result strongly proves the model's effectiveness and the possibility to generalize into the healthcare area, providing insightful guidance for future work.
AutoCycle-VC: Towards Bottleneck-Independent Zero-Shot Cross-Lingual Voice Conversion
Choi, Haeyun, Gim, Jio, Lee, Yuho, Kim, Youngin, Suh, Young-Joo
This paper proposes a simple and robust zero-shot voice conversion system with a cycle structure and mel-spectrogram pre-processing. Previous works suffer from information loss and poor synthesis quality due to their reliance on a carefully designed bottleneck structure. Moreover, models relying solely on self-reconstruction loss struggled with reproducing different speakers' voices. To address these issues, we suggested a cycle-consistency loss that considers conversion back and forth between target and source speakers. Additionally, stacked random-shuffled mel-spectrograms and a label smoothing method are utilized during speaker encoder training to extract a time-independent global speaker representation from speech, which is the key to a zero-shot conversion. Our model outperforms existing state-of-the-art results in both subjective and objective evaluations. Furthermore, it facilitates cross-lingual voice conversions and enhances the quality of synthesized speech.
- Asia > South Korea > Gyeongsangbuk-do > Pohang (0.04)
- Asia > India (0.04)
Emotion Embeddings $\unicode{x2014}$ Learning Stable and Homogeneous Abstractions from Heterogeneous Affective Datasets
Human emotion is expressed in many communication modalities and media formats and so their computational study is equally diversified into natural language processing, audio signal analysis, computer vision, etc. Similarly, the large variety of representation formats used in previous research to describe emotions (polarity scales, basic emotion categories, dimensional approaches, appraisal theory, etc.) have led to an ever proliferating diversity of datasets, predictive models, and software tools for emotion analysis. Because of these two distinct types of heterogeneity, at the expressional and representational level, there is a dire need to unify previous work on increasingly diverging data and label types. This article presents such a unifying computational model. We propose a training procedure that learns a shared latent representation for emotions, so-called emotion embeddings, independent of different natural languages, communication modalities, media or representation label formats, and even disparate model architectures. Experiments on a wide range of heterogeneous affective datasets indicate that this approach yields the desired interoperability for the sake of reusability, interpretability and flexibility, without penalizing prediction quality. Code and data are archived under https://doi.org/10.5281/zenodo.7405327 .
- North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
- North America > United States > California > San Francisco County > San Francisco (0.14)
- North America > United States > New Mexico > Santa Fe County > Santa Fe (0.04)
- (33 more...)
- Media > Music (1.00)
- Leisure & Entertainment (1.00)
- Health & Medicine (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Cognitive Science > Emotion (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.92)